The following material has been adapted from the 2023 Swiss Institute of Bioinformatics single cell RNA sequencing course (https://sib-swiss.github.io/single-cell-training/) authored by Tania Wyss, Rachel Marcone-Jeitziner, Geert van Geest, and Patricia Palagi
library(Seurat)
library(scater)
library(ggplot2)
library(dittoSeq)
load("unfiltered_seu_pbmmc_dataset.RData")
###############################################################################
One of the first plots that we often use is a violin plot of feature/gene counts across all of the samples. This can be done using the seurat VlnPlot function and can show if there are problematic patterns within the data (eg low counts or genes in one or more samples).
Generally for Seurat functions, the “features” argument selects numeric variables for plotting while the “group.by” argument groups data points according to a factor variable. By default Seurat, will group samples by the “active.ident”
# this code produces the same result
VlnPlot(seu,features = "nCount_RNA")
# as this code
VlnPlot(seu,features = "nCount_RNA",group.by = "orig.ident")
# plotting RNA and feature (gene) counts simultaneously using the c() function
VlnPlot(seu,features = c("nCount_RNA","nFeature_RNA"),group.by = "orig.ident")
# Plot the three major qc metrics together
Seurat::VlnPlot(seu, features = c("percent.mito",
"nCount_RNA",
"nFeature_RNA"))
PBMMC-2 has higher counts and more cells. Mitochondrial percentage seems to be similar across samples and is low overall (probably much lower than in your own data). This suggests that there are not high numbers of damaged cells.
Histograms can be a quick and easy way of checking the spread of counts and features across all cells. Sometimes the peaks/troughs on a histogram can help in setting quality control thresholds (eg minimum numbers of counts)
# exercise - plot a histogram of the counts using the hist() function
hist(seu$nFeature_RNA)
# separating the histogram into more bars using the breaks argument can help visualisation
hist(seu$nFeature_RNA,breaks = 200)
# now let's do the same for counts
hist(seu$nCount_RNA,breaks = 200)
Scatter plots provide the ability to look at multiple qc metrics at the same time. The main things to look out for are groups of cells that have low counts AND low features (possibly background/ambient RNA rather than cells) as well as cells that have unusually high counts AND high features (possible doublets). The Seurat FeatureScatter function is used here with cells coloured based on sample. The function also returns the Pearson correlation between the two features above the plot
Seurat::FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
Seurat::FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "percent.mito")
It is possible to do a scatterplot of two metrics coloured by a third metric using ggplot. These three-way plots can show cells with low counts and features, and high mitochondrial percentage. Examples using other QC metrics are in the appendix at the end of this document.
ggplot(data = seu[[]]) +
geom_point(mapping = aes(x= nCount_RNA, y= nFeature_RNA, color= percent.mito),
size=0.5, position= 'jitter') +
scale_color_gradientn(colors = c('blue', 'red')) + theme_bw()
# you can save the plots easily using the ggsave function
ggsave("scatterplot_nCount_nFeature_percent_mito.png")
## Saving 7 x 5 in image
A stacked barplot can be helpful for visualising proportions. Here we use it for cell cycle phase but you can also use it later for looking at the proportion of samples within each cluster downstream
ggplot(seu[[]], aes(fill= as.factor(seu$Phase), x= orig.ident))+
geom_bar(position="fill") +
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0.5)) +
labs(fill="Phase")
You can do a similar plot using the dittoseq package as well
dittoSeq::dittoBarPlot(seu, var = "Phase", group.by = "orig.ident")
Until now we have focused only on the major QC metrics. Let’s plot the other QC metrics to see if there are any differences between the samples
Seurat::VlnPlot(seu, features = c("percent.mito",
"percent.ribo",
"percent.globin"))
PBMMC-2 looks different to the other samples for globin and ribosomal %
Let’s see if there is a relationship between globin and ribosomal %
Seurat::FeatureScatter(seu,
feature1 = "percent.globin",
feature2 = "percent.ribo")
A lot of cells from PBMMC-2 have high haemoglobin and low ribosomal genes. These are likely to be erythroid cells.
A scatterplot coloured by the percent.top50 shows that the cells are low complexity as well - further evidence that they are likely to be erythroid cells
ggplot(data = seu[[]]) +
geom_point(mapping = aes(x = percent.globin , y = percent.ribo, color = percent.top_50),
size=0.5, position = 'jitter') +
scale_color_gradientn(colors = c('blue', 'red')) + theme_bw()
The first consideration is whether or not you are interested in erythroid cells. If you are interested in them, then you keep them in. If you aren’t interested, then you could consider applying a filtering threshold based on percent.globin. However, there is unlikely to be an issue with keeping them in as they will most likely separate out into their own cluster on the UMAP plot. We will keep them in for now and perform filtering based on the major QC metrics.
We have all of the filtering metrics required to set our threshold but first it is a good idea to make a copy of the unfiltered Seurat object so that we don’t need to run all of the code above again to test separate filtering thresholds.
# We make a new copy of the seu object called unfiltered_seu
unfiltered_seu <- seu
Filtering can be done by subsetting the Seurat object. Thresholds are extremely subjective and very much dataset specific (different mitochondrial and count thresholds). As long as you choose thresholds that are reasonable and can explain them then you will be okay.
# These are the filtering thresholds used in the actual publication with the
# addition of nFeature_RNA < 5000 to control for possible doublets
seu <- subset(unfiltered_seu, subset = nFeature_RNA > 200 &
nFeature_RNA < 5000 &
percent.mito < 8)
# Check size of unfiltered object
unfiltered_seu
## An object of class Seurat
## 18673 features across 6946 samples within 1 assay
## Active assay: RNA (18673 features, 0 variable features)
# Check size of filtered object
seu
## An object of class Seurat
## 18673 features across 6830 samples within 1 assay
## Active assay: RNA (18673 features, 0 variable features)
You can add invert=TRUE into the subset function above to make a Seurat object just of the cells that you are removing. Can help to show if filtering is uneven across samples (eg many cells being removed from one sample but not others)
Outside of visually choosing thresholds, you can also use statistical approaches to identify outliers. An example is the isOutlier function from the scuttle package. The function’s default is to identify a threshold of 3 median absolute deviations. In the examples below we find a mitochondrial threshold and a nFeature_RNA threshold. The ‘type’ argument changes because we only want to remove cells with high mitochondrial content but for nFeature_RNA we want to remove those with both higher (doublets) and lower (low quality/empty cells) feature counts.
# library(scuttle) # scuttle was loaded with scater earlier
# Find a mitochondrial threshold
mito_outlier <- isOutlier(seu$percent.mito,type = "higher")
attr(mito_outlier,"thresholds")
## lower higher
## -Inf 9.330935
# Find nFeature_RNA thresholds
features_outlier <- isOutlier(seu$nFeature_RNA,type = "both",log = T)
attr(features_outlier,"thresholds")
## lower higher
## 200.3846 5397.6207
save(seu,file="filtered_seu_pbmmc_dataset.RData")
#########################################################################################
#########################################################################################
# histogram with title and xlabel
hist(seu$nFeature_RNA,breaks = 200,main = "pbmmc feature counts",xlab = "feature counts")
# histogram with only lower feature counts (looking for lower cutoff)
hist(seu$nFeature_RNA[seu$nFeature_RNA<2000],breaks = 200,
main = "pbmmc feature counts",xlab = "feature counts")
# can use the abline function to add a line to a plot
abline(v = 500, col = "red")
# Here we are plotting the lower UMI counts
hist(seu$nCount_RNA[seu$nCount_RNA<4000],breaks = 200,main = "pbmmc UMI counts",xlab = "UMI counts")
Sometimes it can be more helpful visually to plot transformed counts Changing the base in the log function will spread/compress the counts
seu$log_nCountRNA <- log(seu$nCount_RNA,base=2)
seu$log_nFeatureRNA <- log(seu$nFeature_RNA,base=2)
VlnPlot(seu,features = c("nCount_RNA","log_nCountRNA"))
VlnPlot(seu,features = c("nFeature_RNA","log_nFeatureRNA"))
# Globin percentage, ribosomal percentage, top50
ggplot(data = seu[[]]) +
geom_point(mapping = aes(x = percent.globin , y = percent.ribo, color = percent.top_50),
size=0.5, position = 'jitter') +
scale_color_gradientn(colors = c('blue', 'red')) + theme_bw()
# Globin ribo nFeature
ggplot(data = seu[[]]) +
geom_point(mapping = aes(x = percent.globin , y = percent.ribo, color = nFeature_RNA),
size=0.5, position = 'jitter') +
scale_color_gradientn(colors = c('blue', 'red')) + theme_bw()
# Count feature percent.mito
ggplot(data = seu[[]]) +
geom_point(mapping = aes(x = nCount_RNA , y = percent.mito, color = nFeature_RNA),
size=0.5, position = 'jitter') +
scale_color_gradientn(colors = c('blue', 'red')) + theme_bw()
# ggplot coloured using the viridis colour scheme (requires viridis package)
# install.packages("viridis")
# library(viridis)
ggplot(data = seu[[]]) +
geom_point(mapping = aes(x = nCount_RNA , y = nFeature_RNA, color = percent.mito),
size=0.5, position = 'jitter') +
scale_color_viridis_c() + theme_bw()
Another quality control step is to look at the genes that have the highest expression in the dataset. This can give an indication of the cell types that have been captured and give an indication of damaged cells as well. Unfortunately there is no function in Seurat for this so here is one that has been prepared earlier.
most_expressed_boxplot <- function(object, ngenes = 20){
# matrix of raw counts
cts <- Seurat::GetAssayData(object, assay = "RNA", slot = "counts")
# get percentage/cell
cts <- t(cts)/colSums(cts)*100
medians <- apply(cts, 2, median)
# get top n genes
most_expressed <- order(medians, decreasing = T)[ngenes:1]
most_exp_matrix <- as.matrix((cts[,most_expressed]))
# prepare for plotting
most_exp_df <- stack(as.data.frame(most_exp_matrix))
colnames(most_exp_df) <- c("perc_total", "gene")
# boxplot with ggplot2
boxplot <- ggplot(most_exp_df, aes(x=gene, y=perc_total)) +
geom_boxplot() +
coord_flip()
return(boxplot)
}
# run the function on the seu object with your desired number of genes
most_expressed_boxplot(seu, 20)
MALAT1 is a long non-coding RNA that is very commonly highly expressed in single-cell data. It is generally inversely correlated with cell quality so often researchers remove it but it can also have a biological role. There are also multiple ribosomal genes because ribosomes are found across most cell types. There are also HBB and HBA2 which are suggestive of erythroid cells.
################################################################################